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ABSTRACT 

As an essential operation in data cleaning, the similarity join 
has attracted considerable attention from the database com- 
munity. In this paper, we study string similarity joins with 
edit-distance constraints, which find similar string pairs from 
two large sets of strings whose edit distance is within a given 
threshold. Existing algorithms are efficient either for short 
strings or for long strings, and there is no algorithm that 
can efficiently and adaptively support both short strings 
and long strings. To address this problem, we propose a 
partition-based method called Pass-Join. Pass-Join par- 
titions a string into a set of segments and creates inverted 
indices for the segments. Then for each string, Pass-Join 
selects some of its substrings and uses the selected substrings 
to find candidate pairs using the inverted indices. We devise 
efficient techniques to select the substrings and prove that 
our method can minimize the number of selected substrings. 
We develop novel pruning techniques to efficiently verify the 
candidate pairs. Experimental results show that our algo- 
rithms are efficient for both short strings and long strings, 
and outperform state-of-the-art methods on real datasets. 



1. INTRODUCTION 

A string similarity join between two sets of strings finds all 
similar string pairs from the two sets. For example, consider 
two sets of strings {vldb, sigmod, . . . } and {pvldb, icde, . . . }. 
We want to find all similar pairs, e.g., (vldb, pvldb). Many 
similarity functions have been proposed to quantify the simi- 
larity between two strings, such as Jaccard similarity. Cosine 
similarity, and edit distance. In this paper, we study string 
similarity joins with edit-distance constraints, which, given 
two sets of strings, find all similar string pairs from the two 
sets, such that the edit distance between each string pair 
is within a given threshold. The string similarity join is an 
essential operation in many applications, such as data in- 
tegration and cleaning, near duplicate object detection and 
elimination, and collaborative filtering. 
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Existing methods to address this problem can be broadly 
classified into two categories. The first one uses a filter- and- 
refine framework, such as Part-Enum [2], All-Pairs-Ed [3], 
ED- Join [23]. In the filter step, they generate signatures 
for each string and use the signatures to generate candidate 
pairs. In the refine step, they verify the candidate pairs 
to generate the final results. However, these approaches 
are inefficient for the datasets with short strings (e.g., per- 
son names and locations) [20]. The main reason is that 
they cannot select high-quality signatures for short strings 
and will generate large numbers of candidates which need 
to be further verified. The second method (Trie- Join [20]) 
adopts a trie-based framework, which uses a trie structure to 
share prefixes and utilizes prefix pruning to improve the per- 
formance. However Trie- Join is inefficient for long strings 
(e.g., paper titles and abstracts). This is because long strings 
have a small number of shared prefixes. 

If a system wants to support both short strings and long 
strings, we have to implement and maintain two separate 
codes, and tune many parameters to select the best method. 
To alleviate this problem, it calls for an adaptive method 
which can efficiently support both short strings and long 
strings. In this paper we propose a partition-based method 
to address this problem. We devise a partition scheme to 
partition a string into a set of segments and prove that if a 
string s is similar to string r, s must have a substring which 
matches a segment of r. Based on this observation, we pro- 
pose a partition-based framework for string similarity joins, 
called Pass-Join. Pass-Join creates inverted indices for the 
segments. For each string s, we select some of its substrings, 
and search for the selected substrings in the inverted indices. 
If a selected substring appears in the inverted index, each 
string r on the inverted list of this substring (i.e., r contains 
the substring) may be similar to s, and we take r and s as 
a candidate pair. Next we verify the pair to generate the 
final answers. We develop effective techniques to select sub- 
strings and prove that our method can minimize the number 
of selected substrings. We devise novel pruning techniques 
to verify candidate pairs. To summarize, we make the fol- 
lowing contributions. 

(1) We devise a partition scheme to partition strings into 
a set of segments. Using the partition scheme, we propose 
a partition-based framework to facilitate similarity joins. 

(2) We develop novel techniques to select substrings and 
use them to generate candidate pairs. We prove that our 
method can minimize the number of selected substrings. 

(3) We propose an extension-based method to efficiently 
verify a candidate pair, and develop pruning techniques and 
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early-termination techniques to improve the performance. 

(4) We have conducted an extensive set of experiments. 
Experimental results show that our algorithms are very effi- 
cient for both short strings and long strings, and outperform 
state-of-the-art methods on real datasets. 

The rest of this paper is organized as follows. We for- 
malize our problem in Section 2. Section 3 introduces our 
partition-based framework. We propose to effectively select 
substrings in Section 4 and develop novel techniques to effi- 
ciently verify candidates in Section 5. Experimental results 
are provided in Section 6. We review related work in Sec- 
tion 7 and make a conclusion in Section 8. 

2. PROBLEM FORMULATION 

Given two collections of strings, a similarity join finds 
all similar string pairs from the two collections. In this 
paper, we use edit distance to quantify the similarity be- 
tween two strings. Formally, the edit distance between two 
strings r and s, denoted by ED(r, s), is the minimum number 
of single-character edit operations (i.e., insertion, deletion, 
and substitution) needed to transform r to s. For example, 
ED("kaushic chaduri", "kaushuk chadhui") = 4. 

In this paper two strings are similar if their edit distance 
is not larger than a specified edit-distance threshold r. We 
formalize the problem of string similarity joins as follows. 

Definition 1 (String Similarity Joins). Given two 
sets of strings IZ and S and an edit-distance threshold r, a 
similarity join finds all similar string pairs {r,s) ^ 71 x S 
such that ED(r, s) < r. 

Without loss of generality, we focus on self join in this 
paper, that is 7^ = <S. We will discuss how to join two 
distinct sets {IZ / S) in Section 3. 

For example, consider the strings in Table 1(a). Suppose 
threshold T=3. ("kaushik chakrab", "caushik chakrabar") 
is a similar pair as their edit distance is not larger than r. 



Table 1: 

(a) Strings 



A set of strings 

(b) Sorted strings 



Strings 



avatar esha 
caushik chakrabar 
kaushic chaduri 
kaushik chakrab 
kaushuk chadhui 
vankatesh 



ID 


Strings 


Length 


Sl 


vankatesh 


9 


S2 


avatar esha 


10 


S3 


kaushic chaduri 


15 


S4 


kaushik chakrab 


15 


S5 


kaushuk chadhui 


15 


S6 


caushik chakrabar 


17 



3. PARTITION-BASED SIMILARITY JOINS 

We first introduce a partition scheme to partition a string 
into several disjoint segments (Section 3.1), and then pro- 
pose a partition-based framework (Section 3.2). 

3.1 Partition Scheme 

Given a string s, we partition it into r + 1 disjoint seg- 
ments, and the length of each segment is not smaller than 
one*. For example, consider string si = "vankatesh". Sup- 
pose r — 3. We have multiple ways to partition si into 
r + 1 = 4 segments, such as {"va","nk","at", "esh"}. 

Consider two strings r and s. Us has no substring that 
matches a segment of r, s cannot be similar to r based on 
the pigeonhole principle as stated in Lemma 1. Due to space 
constraints, we refer readers to our technical report [16]. In 



The length of string s(|s|) should be larger than r, : 



\s\ > r + 1. 



other words, if s is similar to r, s must contain a substring 
matching a segment of r. For example, consider strings in 
Table 2. Suppose r=3. si = "vankatesh" has four segments 
{"va", "nk", "at", "esh"}. As S3,S4,S5,S6 have no sub- 
strings matching segments of si, they are not similar to si. 

Lemma 1. Given a string r with r + 1 segments and a 
string s, if s is similar to r within threshold r, s must contain 
a substring which matches a segment of r. 

Given a string, there could be many strategies to partition 
the string into r + l segments. A good partition strategy can 
reduce the number of candidate pairs and thus improve the 
performance. Intuitively, the shorter a segment of r is, the 
higher probability the segment appears in other strings, and 
the more strings will be taken as r's candidates, thus the 
pruning power is lower. Based on this observation, we do 
not want to keep short segments in the partition. In other 
words, each segment should have nearly the same length. 
Accordingly we propose an even-partition scheme. Consider 
a string s with length \s\. In even partition, each segment 
has a length of Lt+tJ Tt+tI' thus the maximal length 
difference between two segments is 1. Let k — \s\ — Lt+tJ * 
(r + 1). In even partition, the last k segments have length 
[t+iI? the first T +1 — k ones have length Lt+tJ- For 
example, consider si = "vankatesh" and suppose r — 3. We 
have A; = 1. si has four segments { "va" , "nk" , "at" , "esh"}. 

Although we can devise other partition schemes, it is time 
consuming to select a good partition strategy. Note that the 
time for selecting a partition strategy should be included in 
the similarity join time. In this paper we use the even- 
partition scheme and leave the problem of selecting good 
partition strategies as a future work. Note that our proposed 
techniques can be extended to other partition strategies. 

3.2 Partition-based Framework 

We have an observation that if a strings s does not have 
a substring that matches a segment of r, we can prune the 
pair {s,r). We can use this feature to prune large numbers 
of dissimilar pairs. To this end, we propose a partition- 
based framework for string similarity joins, called Pass- 
Join. Figure 2 illustrates our framework. 

For ease of presentation, we first introduce some nota- 
tions. Let Sl denote the set of strings with length / and SI 
denote the set of the i-th segments of strings in Si . We build 
an inverted index for each <S/, denoted by C]. Given an i-th 
segment w, let Cl{w) denote the inverted list of segment 
i.e., the set of strings whose i-th segments are w. Pass- Join 
uses the inverted indices to do similarity joins as follows. 

Pass-Join first sorts strings based on their lengths in as- 
cending order. For the strings with the same length, it sorts 
them in alphabetical order. Then Pass-Join visits strings in 
order. Consider the current string s with length \s\. Pass- 
Join finds s's similar strings among the visited strings using 
the inverted indices. To efficiently find such strings, we cre- 
ate indices only for visited strings to avoid enumerating a 
string pair twice. Based on length filtering [7], we check 
whether the strings in£J {\s\ — r < I < |s|,l<i<r + l) 
are similar to s. Without loss of generality, consider inverted 
index £J. Pass- Join finds s's similar strings in C] as follows. 

• Substring Selection: If s is similar to a string in £J, 
s should contain a substring which matches a segment 
in CI . A straightforward method enumerates all of 
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<3, 4> 



Candidates'. ' Candidates'. 

I I ^j, I <3, 5>;<4, 5> I <3, 6>; <4, 6>;<5, 6> 

r: (|) I I Answer: ^ \ Answer: ^ \ Answer: <4, 6> 

Figure 1: An example of our partition-based framework 




Candidates'. 



Visited 



Indexes 



S\s\-A 



Si 



/ 1% I 
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Unvisited *> 



• Substring SeleQtfon 

• Verification 

J , J 

• Add segments to L\s\ , L\s\ 

• Remove Lk, Lk^^{k<\s\-x) 



Figure 2: Partition-based framework 

s's substrings, and for each substring checks whether 
it appears in C\. Actually we do not need to con- 
sider all substrings of s. Instead we only select some 
substrings (denoted by W(s,£J)) and use the selected 
substrings to find similar pairs. We discuss how to 
generate W(s, £J) in Section 4. For each selected sub- 
string w G >V(s,£J), we check whether it appears in 
C\. If so, for each r G C\{w)^ (r, s) is a candidate pair. 
• Verification: To verify whether a candidate pair 
(r, s) is an answer, a straightforward method computes 
their real edit distance. However this method is rather 
expensive, and we develop effective techniques to do 
efficient verification in Section 5. 
After finding similar strings for s, we partition s into 
T + 1 segments and insert the segments into inverted index 
>^|s| Then we visit strings after s and iteratively 

we can find all similar pairs. Note that we can remove the 
inverted index C\ for k < \s\—r. Thus we maintain at most 
(r + 1)^ inverted indices £J for t</<|s| and l<i<T+l. 

To join two distinct sets IZ and <S, we first sort the strings 
in the two sets respectively. Then we index the segments of 
strings in a set, e.g., S. Next we visit strings of IZ in order. 
For each string r^Tl with length |r|, we use the inverted 
indices of strings in S with lengths between [Irl— r, |t|+t] 
to find similar pairs. We can remove the indices for strings 
with lengths smaller than |r|— r. In this paper we focus on 
the case that the index can be fit in the memory. We leave 
dealing with a very large dataset as a future work. 

For example, consider strings in Table 1. Suppose r = 3. 
We find similar pairs as follows (Figure 1). For the first 
string si = "vankatesh" , we partition it into r + 1 seg- 
ments and insert the segments into the inverted indices for 



Algorithm 1: Pass-Join (<S,r) 



Input: S: A collection of strings 

r: A given edit-distance threshold 
Output: A= {{s eS,r eS) \ ED (s,r) < r} 

1 begin 

2 Sort S first by string length and second in 
alphabetical order; 

3 for s G <S do 

4 for Ci {\s\ - T <l< 1 < i < T + 1) do 

5 W(s,£J) = SubstringSelection(s, C]) 

6 for w eW{s,Ci) do 

7 if w is in C\ then 
Verification(s, C\ (w) , r) ; 



8 

9 end 



Partition s and add its segments into C 



Function SubstringSelectionCs^ £J) 

Input: s: A string; £J: Inverted index 
Output: W(s,£J): Selected substrings 

1 begin 

2 I W(s,£J) = {w \ w is a, substring of s}; 

3 end 



Function Verification (S; Cl{w), r) 

Input: s: A string; C\{w)'. Inverted list; r: Threshold 
Output: ^ = {(s G <S,r G <S) I ED (s,r) < r} 

1 begin 

2 for r G C\{w) do 

3 |_ if ed(s, r) <t then A ^ (s, r); 

4 end 



Figure 3: Pass-Join algorithm 

strings with length 9, i.e., £9, £9, £9, and £9. Next for 
52 = "avataresha" , we enumerate its substrings and check 
whether each substring appears in C^\g^\_^^ • • • , £1^21 (1 ^ ^ ^ 
r + 1). Here we find "va" in £9, "at" in £9, and "esh" in £9. 
For segment "va", as £9(va) = {si}. The pair (s2,si) is a 
candidate pair. We verify the pair and it is not an answer as 
the edit distance is larger than r. Next we partition S2 into 
four segments and insert them into >^fs2l ' ^fs2l ' ^fs2l ' ^fs2l • 
Similarly we repeat the above steps and find all similar pairs. 

We give the pseudo-code of our algorithm in Figure 3. 
Pass-Join sorts strings first by length and then in alpha- 
betical order (line 2). Then, Pass- Join visits each string in 
sorted order (line 3). For each inverted index £J(kl — ^ ^ 
/<|s|,l<i<r + l). Pass- Join selects the substrings of 
s (line 4) and checks whether each selected substring w is in 
£J (line 5). If yes, for any string r in the inverted list of w 
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in £J, i.e., CKw), the string pair (r, s) is a candidate pair. 
Pass- Join verifies the pair (hne 7). Finally, Pass-Join par- 
titions s into T + 1 segments, and inserts the segments into 
the inverted index < i < t + 1) (line 8). Here func- 

tion SubstringSelection selects all substrings and func- 
tion Verification computes the real edit distance of two 
strings to verify the candidates using dynamic-programming 
algorithm. To improve the performance, we propose effec- 
tive techniques to improve the substring-selection step in 
Section 4 and the verification step in Section 5. 

Complexity: We first analyze the space complexity. Our 
indexing structure includes segments and inverted lists of 
segments. We first give the space complexity of segments. 
For each string in Si we generate r + 1 segments. Thus the 
number of segments is at most (r + 1) x |<S; |, where |<S; | is the 
number strings in Si . As we can use an integer to encode a 
segment, the space complexity of segments is 

j 

^'"min'SiJ 'Si'"max . ' 
1^3— T 

where Imin and Imax respectively denote the minimal string 
length and the maximal string length. 

Next we give the complexity of inverted lists. For each 
string in Si^ as the i-ih segment of the string corresponds 
to an element in £J, \Si\ — The space complexity of 

inverted lists(i.e., the sum of the lengths of inverted lists) is 

l—j—T t—1 l—J—T 

Then we give the time complexity. To sort the strings, 
we can first group the strings based on lengths and then 
sort strings in each group. Thus the sort complexity is 
^{^i ■ <i<imax 1*^^ l^^^d*^^)) • -^^^ each string s, we select 
its substring set W(s,£j) for |s| - r < / < |s|, 1 < i < r + 1. 
The selection complexity is O (^,^5 T.\t\s\-r EI^i' ^(s, Q)) , 

where X{s, C]) is the selection time complexity for W(s, 
which is 0{r) (Section 4). The selection complexity is (!)(r^|<S|) . 
For each substring i(;G>V(s,£J), we verify whether strings 
in C\{w) are similar to s. The verification complexity is 

O (Ese5 Er=i' E^ew(s,/:p E^er^^) ^(^^ O) , where 

V(s,r) is the complexity for verifying {s,r), which is 0{r * 
min(|s|, |r|))(Section 5). In the paper we propose to reduce 
the size of W(s, £J) and improve the verification cost V(s, r). 

4. IMPROVING SUBSTRING SELECTION 

For any string s G <S and a length / (|s| — r < / < we 
select a substring set W(s,/) = Uj^^*^ >V(s, £J) of s and use 
substrings in W(s, /) to find the candidates of s. We need to 
guarantee completeness of the method using W(s, /) to find 
candidate pairs. That is any similar pair must be found as 
a candidate pair. Next we give the formal definition. 

Definition 2 (Completeness). A substring selection 
method satisfies completeness, if for any string s and a length 
Kkl ~ ^ ^ ^ ^ kl)^ ^ with length I which is similar to s 
and visited before s, r must have an i-th segment rm which 
matches a substring Sm G W(s, C\) where 1 < i < r + 1. 

A straightforward method is to add all substrings of s into 
W(s,/). As 8 has |s| — i + 1 substrings with length i, the 



total number of s 's substrings isEl=i(kl~'^+l)— 2^ • 
For long strings, there are large numbers of substrings and 
it is rather expensive to enumerate all substrings. 

Intuitively, the smaller size of W(s,/), the higher perfor- 
mance. Thus we want to find substring sets with smaller 
sizes. In this section, we propose several methods to select 
the substring set W(s,/). As W(s,/) = U[+i^W(s,£j) and 
we want to use index C\ to do efficient filtering, next we 
focus on how to generate W(s, £J) for C\. 

Length-based Method: As segments in C] have the same 
length, denoted by /i, the length-based method selects all 
substrings of s with length denoted by >V^(s,£J). Let 
W^(s,/) = l}l^lWi{s,Ci). The length-based method sat- 
isfies completeness, as it selects all substrings with length 
h. The size of Wi{s,C\) is |W^(s, £j)| = |s|-/i+l, and the 
number of selected substrings is |W^(s, /)| = (t+1)(|s| + 1)— /. 

Shift-based Method: However the length-based method 
does not consider the positions of segments. To address this 
problem, Wang et al. [22] proposed a shift-based method to 
address the entity identification problem. We can extend 
their method to support our problem as follows. As seg- 
ments in C\ have the same length, they have the same start 
position, denoted by pi, where pi — 1 and Pi=Pi+Efc=i ^fc 
for i > 1. The shift-based method selects s's substrings with 
start positions in [pi— r, p^+r] and with length denoted 
by >V/(s,£j)- Let W/(s,/) = U[+i^W/(s, £j). The size of 
>V/(s,£j) is |W/(s,£j)l=2T + L The number of selected 
substrings is |>V/(s, /)|==(r+l)(2r+l). 

The basic idea behind the method is as follows. Suppose 
a substring Sm of s with start position smaller than pi — r 
or larger than pi -\- r matches a segment in £J . Consider a 
string r G Cl{sm)- We can partition s{r) into three parts: 
the matching part Sm{rm), the left part before the matching 
part si(ri), and the right part after the matching part Sr{rr). 
As the start position of rm is pi and the start position of Sm 
is smaller than pi — r or larger than + r, the length dif- 
ference between si and ri must be larger than r. If we align 
the two strings by matching Sm and rm (i.e., transforming 
ri to si, matching rm with Sm, and transforming rr to Sr), 
they will not be similar, thus we can prune substring Sm- 
Hence the shift-based method satisfies completeness. 

However, the shift-based method still involves many un- 
necessary substrings. For example, consider two strings si 
= "vankatesh" and 82 — "avataresha" . Suppose r = 3 
and "vankatesh" is partitioned into four segments {va, nk, 
at, esh}. 82 — "avataresha" contains a substring "at" 
which matches the third segment in "vankatesh" , the shift- 
based method will select it as a substring. However we can 
prune it and the reason is as follows. Suppose we parti- 
tion the two strings into three parts based on the match- 
ing segment. For instance, we partition "vankatesh" into 
{"vank" , "at", "esh"}, and "avataresha" into{"av", "at", 
"aresha"}. Obviously the minimal edit distance (length dif- 
ference) between the left parts ("vank" and "av") is 2 and 
the minimal edit distance (length difference) between the 
right parts ("esh" and "aresha") is 3. Thus if we align the 
two strings using the matching segment "at" , they will not 
be similar. In this way, we can prune the substring "at" . 

4.1 Position-aware Substring Selection 

Notice that all the segments in C] have the same length U 
and the same start position pi. Without loss of generality. 
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(fl) Minimal Position Pmin = max(l, /?/ - |_'^ ) 

A; < L'^^J asT> di+dr > Ai +(A/+A) 



di=ed{ri, si)>Ai Imatchl \d,=ed{rr, s,)>Ai+A i 

I / I I ^"I'^rKl 



A... 



I.'- 



Pmin ' "'" Pi ' 

A/ Ai 



|s| 



(6) Maximal Position fmax =min(\s\ - // + 1, /^/ + f^^_J) 

Ar<t^2 ^ (2^ T > >Ar+ (Ar-A) 




" /'i=l Xr '' ^'"'^ 

Figure 4: Position-based substring selection 

we consider a segment rm ^ C]. Moreover, all the strings in 
inverted list £J(rm) have the same length / (/ < and we 
consider a string r that contains segment rm- Suppose s has 
a substring Sm which matches rm ■ Next we give the possible 
start positions of Sm- We still partition s{r) into three parts: 
the matching part Sm(rm), the left part si{ri), and the right 
part Sr{rr). If we align r and s by matching rm = Sm, that 
is we transform r to s by first transforming ri to si with 
di = E'D(ri,si) edit operations, then matching rm with Sm, 
and finally transforming Vr to Sr with dr — ED (rr,Sr) edit 
operations, the total transformation distance is di -\- dr. If 
s is similar to di -\- dr < r. Based on this observation, 
we give Sm's minimal start position (pmin) and the maximal 
start position (pmax) as illustrated in Figure 4. 

Minimal Start Position: Suppose the start position of 
Sm, denoted by p, is not larger than pi. Let A = |s| — |r| 
and /\i — Pi — p. We have di — ED(ri, si) > Ai and dr — 
ED(rr,Sr) > A; + A, as illustrated in Figure 4(a). If s is 
similar to r (or any string in £J(rm)), we have 

Ai + {Ai + A)<di + dr<r. 
That is Ai < [^J and p = p^ - Ai > p^ - [^J- Thus 

Pmin>Pi - L^-^J- Aspm2n>l, Pmin = UiaX ( 1, - [^"^J)- 

Maximal Start Position: Suppose the start position of 
Sm,P, is larger than Pi . Let A = |s| — |r| and Ar = p— p^. We 
have di — ED(r;, si) > Ar and dr — ED(rr, Sr) > \ Ar — A\ as 
illustrated in Figure 4(b). If A^ < A, > A - A^. Thus 
A = Ar+(A— Ar) < di -\- dr < and in this case, the 
maximal value of Ar is A; otherwise if Ar>A, dr > Ar — A. 
If s is similar to r (or any string in £J(rm)), we have 

Ar + ( Ar - A) <dl+dr <T. 

That is Ar < , and p = pz + Ar < p^ + • Thus 

Pmax^Pi + L"^^^2^J • As the segment length is k, based on the 
boundary, we have pmacc < \s\-li-\-l. Thus pmax=min(|s| - 

h^l,P^+[^\)• 

For example, consider string r = "vankatesh". Suppose 
T = 3 and "vankatesh" is partitioned into four segments, 
{va, nk, at, esh}. For string s — "avataresha" , we have 



(a) Multi-match from the left-side perspective 

± j = max{\,pi- (f - 1)) t/ =min(\s\ - // +!,/?/ + (f - 1)) 

i-\ segments 

. r-nr-n - ■■; _.r^ , 



Pi=l 



n 



\Pi ^mt Yr / 

di=ed{ri, si) /match/ 1 dr=ed{r„ Sr) I ^ , , , 



..[... 



I.'- 



Pl=l '^^Pi ' ' |s| 

A I <i —las there are i-\ segments in ri 

(b) Multi-match from the right-side perspective 

±^= max(l,pi+A- (T+1-0) Ji=min(\s\-li+l,pi+A+ (t+1-/)) 

T +1- i segments 




Pi=i 



Af < T +1- / as there are x +1- / segments in r^ 
Figure 5: Multi-match-aware substring selection 



A=|s|-|r| = l. Ai< L^J 
For the first segment "va", pi 



1 and Ar < = 2. 



[l^J) = 1 and pmacc = 1+ L^J 



,in = max(l,pi - 

2 jy — - ^--^ ymax — - i L 2 J " ^' ^^^^ WC Ouly 

need to enumerate the following substrings "av" , "va" , "at" 
for the first segment. Similarly, we need to enumerate sub- 
strings "va", "at", "ta", "ar" for the second segment, "ta", 
"ar", "re", "es" for the third segment, and "res", "esh", 
"sha" for the fourth segment. We see that the position- 
based method can reduce many unnecessary substrings over 
the shift-based method (reducing the number from 28 to 14). 

For £J, the posit ion- aware method selects substrings with 
start positions in [pmimPmax] and with length denoted 
by Wp(s,£j). Let Wp(s, /)=U[+i^Wp(s, £j). The size of 
Wp(s,£j) is |>Vp(s,£j)|=r + 1 and the number of selected 
substrings is |Wp(s, /)| = (r+l)^. The posit ion- aware method 
satisfies completeness as formalized in Theorem 1. 

Theorem l.The position- aware substring selection method 
satisfies completeness. 

4.2 Multi-match-aware Substring Selection 

We have an observation that string s may have multiple 
substrings that match some segments of string r. In this 
case we can discard some of these substrings. For example, 
consider r = "vankatesh" with four segments, {va, nk, at, 
esh}. s — "avataresha" has three substrings va, at, esh 
matching the segments of r. We can discard some of these 
substrings. To this end, we propose a mult i- match- aware 
substring selection method. 

Consider C]. Suppose string s has a substring Sm that 
matches a segment in C] . If we know that s must have a sub- 
string after Sm which will match a segment in £^ {j > i) , we 
can discard substring Sm- For example, s = "avataresha" 
has a substring "va" matching a segment in r = "vankatesh" . 
Consider the three parts rm = Sm = "va" ^ n — cj) and si — 
"a", and rr = "nkatesh" and Sr — "taresha". As > 1, if 
s and r are similar, dr < r — di < r — 1 — 2. As there are still 
3 segments in rr, thus Sr must have a substring matching a 
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segment in rr based on the pigeon-hole principle. Thus we 
can discard the substring "va" and use the next substring 
to find similar pairs. Next we generalize our idea. 

Suppose s has a substring Sm with start position p match- 
ing a segment rm G £J. We still consider the three parts of 
the two strings: s;, Sm, Sr and ri^rm, Tr as illustrated in Fig- 
ure 5. Let = \pi — p\. di = ED{ri,si) > Ai. As there 
are i — 1 segments in si, if each segment only has 1 error 
when transforming ri to si, we have Ai < i — 1. If > i, 
di — ED(ri, si) > Ai > i, dr — ED(rr, Sr) < r — di < r — i 
(if s is similar to r). As rr contains r -\- 1 — i segments, Sr 
must contain a substring matching a segment in rr based 
on the pigeon-hole principle, which can be proved similar to 
Lemma 1. In this way, we can discard Sm, since for any 
string r G C\{rm)^ s must have a substring that matches 
a segment in the right part rr, and thus we can identify 
strings similar to s using the next matching segment. In 
summary, if A/ = \p — Pi\ < i — 1, we keep the substring 
with start position p for C\. That is the minimal start po- 
sition is !-[ — max(l,p2 — {i — 1)) and the maximal start 
position is = min(|s| — U -\- l^pi -\- {i — 1)). 

For example, consider string r="vankatesh" with four 
segments, {va, nk, at, esh}, and string s= "avataresha" . 
For the first segment, we have _L - = 1-0=1 and T - = 1+0=1. 
Thus the selected substring is only "av" for the first segment. 
For the second segment, we have ±J=3-1=2 and TJ=3+1=4. 
Thus the selected substrings are "va", "at", and "ta" for 
the second segment. Similarly for the third segment, we have 
±J=5-2=3 and TJ=5+2=7, and for the fourth segment, we 
have _LJ=7-3=4 and TJ=7+3=10. 

The above observation is made from the left-side perspec- 
tive. Similarly, we can use the same idea from the right-side 
perspective. As there are r+l — i segments on the right part 
Tr , there are at most r —i edit errors on rr . If we trans- 
form r to s from the right-side perspective, position pi on r 
should be aligned with position + A on s as shown in Fig- 
ure 5(b). Suppose the position p on s matching position pi 
onr. Let Ar = \p—{pi-\-A)\. We have c/r = ED(sr,'rr) > Ar. 
As there are r + 1 — i segments on the right part rr , we have 
Ar < r -\- 1 — i. Thus the minimal start position for C] is 
±i = max (l, Pi + A — (r+1 — i)) and the maximal start 
position is T[ = min(|s| — + l,_pi + A + (r + 1 — i)) . 

Consider the above example. Suppose r = 3 and A = 1. 
For the fourth segment, we have ±[ = 7+1- (3 +1-4) = 8 
and Ti = 7+l + (3+l-4) = 8. Thus the selected substring 
is only "sha" for the fourth segment. Similarly for the third 
segment, we have +[ = 5 and T[ = 7. Thus the selected 
substrings are "ar", "re", and "es" for the third segment. 

More interestingly, we can use the two techniques simul- 
taneously. That is for £J, we only select the substrings 
with the start positions between ±i = max(+J,+[) and 
Ti = min(T-, T[) and with length denoted by Wm(s, jCI). 
Let Wm(s, /)=U[+i^Wm(s, Ci). The number of selected sub- 

2 _ A 2 

strings is |>Vm(s, ON L"^~^ — J+r+1 as stated in Lemma 2. 
Lemma 2. |>V.n(5,/)| = L^-^^J + ^ + 1- 

Moreover we prove that the multi-match-aware selection 
method satisfies completeness as stated in Theorem 2. 

Theorem 2. The multi-match- aware substring selection 
method satisfies completeness. 

Consider the above example. For the first segment, we 
have ±i = 1 - = 1 and T^ = 1 + = 1. We select 



"av" for the first segment. For the second segment, we have 
+2 = 3 — 1 = 2 and +2 = 3 + 1 = 4. We select substrings 
"va", "at", and "ta" for the second segment. For the third 
segment, we have +i = 5+1 — (3 + 1 — 3) = 5 and Ti — 
5 + 1 + (3 + 1 - 3) = 7. We select substrings "ar" , "re" , and 
"es" for the third segment. For the fourth segment, we have 
+2 = 7+1 -(3 + 1-4) = 8 and +2 = 7+1 + (3 + 1-4) = 8. 
Thus we select the substring "sha" for the fourth segment. 
The mult i- match- aware method only selects 8 substrings. 

4.3 Comparison of Selection Methods 

We compare the selected substring sets of different meth- 
ods. Let W£(s, 0, W/(s, /), Wp(s, /), Wm(s, /) respectively de- 
note the sets of selected substrings that use the length- 
based selection method, the shift-based selection method, 
the posit ion- aware selection method, and the multi-match- 
aware selection method. Based on the size analysis of each 
set, wehave|Wm(s,/)| < |>Vp(s,0| < |>V/(s,/)| < |>V^(s,/)|. 
Next we prove >Vm(s,/) Q >Vp(s,/) C W/(s,/) C Wi{s,l) as 
formalized in Lemma 3. 

Lemma 3. For any string s and a length I, we have 
Wm(s,/) C Wp(s,/) C W/(s,/) C W,(s,/). 

Moreover, we can prove that >Vm(s,/) has the minimum 
size among all substring sets generated by the methods that 
satisfy completeness as formalized in Theorem 3. 

Theorem 3. The substring set >Vm(s,/) generated by the 
multi-match- aw are selection method has the minimum size 
among all the substring sets generated by the substring se- 
lection methods that satisfy completeness. 

Theorem 3 proves that the substring set Wm(s,0 has the 
minimum size. Next we introduce another concept to show 
the superiority of our mult i- match- aware selection method. 

Definition 3 (Minimality). A substring set >V(s,/) 
generated by a method with the completeness property satis- 
fies minimality, if for any substring set W^(s, /) generated by 
a method with the completeness property, W(s, /)CW^(s, /). 

Next we prove that if / > 2(r+l) and |s| > /, the substring 
set >Vm(s,0 generated by our mult i- match- aware selection 
method satisfies minimality as stated in Theorem 4. The 
condition / > 2(t + 1) makes sense where each segment 
is needed to have at least two characters. For example, if 
10 < / < 12, we can tolerate r = 4 edit operations. If 
12 < / < 14, we can tolerate r = 5 edit operations. 

Theorem 4. /// > 2(r+l) and \s\ > I, Wm(s,/) satisfies 
minimality. 

4.4 Substring-selection Algorithm 

Based on above discussion, we improve SubstringSelec- 
TION algorithm by avoiding unnecessary substrings. For £J, 
we use the mult i- match- aware selection method to select 
substrings, and the selection complexity is 0{t). Figure 6 
gives the pseudo-code of the selection algorithm. 

For example, consider the strings in Table 1. We create 
inverted indices as illustrated in Figure 1. Consider string 
si = "vankatesh" with four segments, we build four in- 
verted lists for its segments {va, nk, at, esh}. Then for S2 — 
"avataresha" . We use multi-match- aware selection method 
to select its substrings. Here we only select 8 substrings for 
82 and use the 8 substrings to find similar strings of S2 from 
the inverted indices. Similarly, we can select substrings and 
find similar string pairs for other strings. 
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Algorithm 2: SubstringSelection(s, £J) 

Input: s: A string; C}: Inverted index 
Output: W(s,£J): Selected substrings 

1 begin 

2 for p e [-Li, Ti] do 

3 Add the substring of s with start position p and 
with length /^ (s[p,li]) into W(s,£j); 



T=3 A=\s\-\r\ 



-2 L^J=0 L^J=2 



4 end 



Figure 6: SubstringSelection algorithm 

5. IMPROVING THE VERIFICATION 

In our framework, for string s and inverted index £J, we 
generate a set of its substrings W(s, C]). For each substring 
w G W(s, we need to check whether it appears in £J. If 
w ^ CI, for each string r G Cl{w), (r, s) is a candidate pair 
and we need to verify the candidate pair. In this section we 
propose effective techniques to do efficient verification. 

5.1 Length-aware Verification 

Given a candidate pair (r, s) , a straightforward method to 
verify the pair is to use a dynamic-programming algorithm 
to compute their real edit distance. If the edit distance is 
not larger than r, the pair is an answer. We can use a matrix 
M with |r| + 1 rows and |s| + 1 columns to compute their 
edit distance, in which M(0, j) = j for < j < \s\, and for 
1 < i < |r| and < j < 

M(i,j) =min(M(i-l,j) + l,M(i,j-l) + l,M(i-l,j-l)+^) 

where ^ = if the i-th character of r is the same as the j-th 
character of s; otherwise 5=1. The time complexity of the 
dynamic-programming algorithm is 0{\r\ * |s|). 

Actually, we do not need to compute their real edit dis- 
tance and only need to check whether their edit distance is 
not larger than r. An improvement based on length prun- 
ing [20] is proposed which only computes the values M{i,j) 
for \i — j\ < r, as shown in the shaded cells of Figure 7(a). 
The basic idea is that if \i — j\ > r, M{i,j) > r, and we do 
not need to compute such values. This method improves the 
time complexity V(s, r) to (!)((2*r + 1) *min (|r|, |s|)). Next, 
we propose a technique to further improve the performance 
by considering the length difference between r and s. 

We first use an example to illustrate our idea. Consider 
string r = "kaushuk chadhui" and string s = "caushik 
chakrabar". Suppose r = 3. Existing methods need to 
compute all the shaded values in Figure 7(a). We have an 
observation that we do not need to compute M(2, 1), which 
is the edit distance between "ka" and "c". This is because 
if there is a transformation from r to s by first transforming 
"ka" to "c" with at least 1 edit operation (length differ- 
ence) and then transforming "ushuk chadhui" to "aushik 
chakrabar" with at least 3 edit operations (length differ- 
ence), the transformation distance is at least 4 which is 
larger than r — 3. In other words, even if we do not compute 
M(2, 1), we know that there is no transformation including 
M(2, 1) (the transformation from "ka" to "c") whose edit 
distance is not larger than r. Actually we only need to com- 
pute the highlighted values as illustrated in Figure 7(b). 

To address this problem, we propose a length-aware ver- 
ification method. Without loss of generality, let \s\ > \r\ 
and A = |s| — |r| < T (otherwise their edit distance must 
be larger than r). We call a transformation from r to s in- 
cluding M(i, j), if the transformation first transforms the 
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Figure 7: An example for verification 

first i characters of r to the first j characters of s with 
di edit operations and then transforming the other char- 
acters in r to the other characters in s with d2 edit opera- 
tions. Based on length difference, we have di > \i — j\ and 
d2 > \{\s\ - j) - {\r\ - i)\ = |A+(i-j)|. Udi+d2 >T, we 
do not need to compute M{i,j), since the distance of any 
transformation including M(i,j) is larger than r. To check 
whether di -\- d2 > r, we consider the following cases. 



-j + A + i-j. If 
^ we do not need to 



(1) If i > J, we have di -\- d2 > 
i — j -\- A-\-i — j > r , that is j < i- ^ 
compute M(i,j). In other words, we only need to compute 
M{iJ) with j >i - 

(2) Ui<j,di=j-i. Uj-i<A,di-^d2>j-i + A- 
U ~ — A. As A < r, there is no position constraint. We 
need to compute M{i,j); otherwise if j — i > A, we have 
di -\- d2 >j — — i — A. lij — i — A>r, that is 
j > i -\- -^-^5 we do not need to compute M{i,j). In other 
words, we only need to compute M{i,j) with j < i -\- 



Based on this observation, for each row M(i, *), we only 
compute M(i, j) for i - [^^J < j <i+ L^^J • For exam- 
ple, in Figure 8, we only need to compute the values in black 
circles. Thus we can improve the time complexity V(s,r) 
from (9((2r+l)*min(|r|, |s|)) to C>((r+1) * min(|r|, |s|)). 
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Figure 8: Length-aware verification 



Early Termination: We can further improve the perfor- 
mance by doing an early termination. Consider the values 
in row M(i, *). A straightforward early-termination method 
is to check each value in M(i, *), and if each value is larger 
than r, we can do an early termination. This is because 
the values in the following rows M{k > i, *) must be larger 
than T based on the dynamic-programming algorithm. This 
pruning technique is called prefix pruning. For example in 
Figure 7(a), if r — 3, after we have computed M(13, *), we 
can do an early termination as all the values in M(13, *) 
are larger than r. But in our method, after we have com- 
puted the values in M(6, *), we can conclude that the edit 
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distance between the two strings is at least 4 (larger than 
r = 3). Thus we do not need to compute M{i > 6, *) and 
can terminate the computation as shown in Figure 7(b). To 
this end, we propose a novel early-termination method. 

For ease of presentation, we first introduce several nota- 
tions. Given a string s, let s[i] denote the i-ih character 
and s[i : j] denote the substring of s from the i-th charac- 
ter to the j-th character. Notice that M{i,j) denotes the 
edit distance between r[l : i] and s[l : j]. We can estimate 
the lower bound of the edit distance between r[i : \r\] and 
s[j : \s\] using their length difference | (|s| — j) — (|r| — 1 • We 
use E{i,j) — M{i,j) + |(|s| — j) — (|r| — i)\ to estimate the 
edit distance between s and r, which is called expected edit 
distance of s and r with respect to M{i,j). If each expected 
edit distance for M{i,j) in M(i, *) is larger than r, the edit 
distance between r and s must be larger than r, thus we 
can do an early termination. To achieve our goal, for each 
value M(i,j), we maintain the expected minimal edit dis- 
tance E{i,j). If each value in E{i,^) is larger than r, we 
can do an early termination as formalized in Lemma 4. 

Lemma 4. Given strings s and r, if each value in E(i, *) 
is larger than r, the edit distance of r and s is larger than r. 

For example, in Figure 7(b), we show the expected edit 
distances in the left-bottom corner of each cell. When we 
have computed M(6, *) and £^(6, *), all values in £^(6, *) are 
larger than 3, thus we can do an early termination. In this 
way, we can avoid many unnecessary computations. Note 
that our proposed verification techniques can be applied to 
any other algorithms to verify a candidate pair in terms of 
edit distance (e.g., ED- Join and NGPP). 

5.2 Extension-based Verification 

Consider a selected substring w of string s. li w appears 
in the inverted index C] , for each string r in the inverted list 
C\{w)^ we need to verify the pair (s,r). As s and r share 
a common segment we can use the shared segment to 
efficiently verify the pair. To achieve our goal, we propose 
an extension-based verification algorithm. 

As r and s share a common segment we partition them 
into three parts based on the common segment. We partition 
r into three parts, the left part r^, the matching part — VJ . 
and the right part rr- Similarly, we get three parts for string 
s\ si^ Sm — and Sr- Here we align s and r based on the 
matching substring rm and Sm, and we only need to verify 
whether r and s are similar in this alignment. Thus we first 
compute the edit distance di — ED(r;,s;) between ri and 
si using the above-mentioned method. If di is larger than 
r, we terminate the computation; otherwise, we compute 
the edit distance dr = ED{sr,rr) between Sr and rr. If 
di + dr is larger than r, we discard the pair; otherwise we 
take it as an answer. Note that this method will not involve 
false negatives. This is because based on Lemma 1, if s 
and r are similar, s must have a substring that matches a 
segment of r. In addition, based on dynamic-programming 
algorithm, there must exist a transformation by aligning rm 
with Sm and ED(s,r) = di -\- dr. As our method selects all 
possible substrings and considers all such common segments, 
our method will not miss any results. On the other hand, as 
we find the answers with di-\-dr < r and ed(s, r) < di-\-dr < 
T, our method will not involve false positives. To guarantee 
correctness of our extension-based method, we first give a 
formal definition of correctness. 



T/= mm(T- ||r,|-|^,||, /-I) 5- 
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Tr = min{x— di, t+1-/) 
Figure 9: Extension-based verification 

Definition 4 (Correctness). Given a candidate pair 
{s,r), a verification algorithm is correct, if it satisfies (1) If 
{s,r) passes the algorithm, (s,r) must be a similar pair; and 
(2) If {s,r) is a similar pair, it must pass the algorithm. 

We prove that our extension-based verification method 
satisfies correctness as stated in Theorem 5. 



Theorem 5. Our extension-based verification method sat- 
isfies correctness. 

Actually, we can further improve the verification algo- 
rithm. For the left parts, we can give a tighter threshold 
Ti < r. The basic idea is as follows. As the minimal edit dis- 
tance between the two right segments rr and Sr is | |rr | — |sr 1 1 . 
Thus we can set ri = r — \ \rr \ — \sr\\. If the edit distance 
between ri and si is larger than threshold r^, we can ter- 
minate the verification; otherwise we continue to compute 
dr — ED(rr,Sr). Similarly for the right parts, we can also 
give a tighter threshold Tr < r. As di has been computed, 
we can set Tr = r — as a threshold to verify whether rr 
and Sr are similar. If dr is larger than threshold Tr, we can 
terminate the verification. 

For example, if we want to verify S5 = "kaushuk chadhui" 
and sq = "caushik chakrabar" . S5 and sq share a segment 
"_cha". We have ss^ = "kaushuk" and sqi = "caushik", 
and = "dhui" and sq^ = "krabar". Suppose r = 3. As 

||s5r^| — |s6r.|| = 2, = r — 2 = 1. We only need to verify 
whether the edit distance between 55^ and sqi is not larger 
than Ti = 1. After we have computed M(6, *), we can do an 
early termination as each value in E{6, *) is larger than 1, as 
shown in Figure 7. Note that as = 1 and IssJ — IseJ = 0, 
-Li = Ti = 0. Thus we only need to compute M(i, i). 

We discuss how to deduce a tighter bound for ri and Tr. 
Consider the i-th segment, li di > i, we can terminate the 
verification based on the mult i- match- aware method. Thus 
we have ri = i — 1. Combining with the above pruning 
condition, we have ri=mm{r — \ \rr \ — \sr\\,i — 1). As | |rr| — 
\sr\\ = \{\r\-pi-li)-{\s\-p-li)\ = \p-pi-A\ < r+l-i (based 
on the mult i- match- aware method), r — \ \rr \ — \sr\ \ > i — 1. 
We set ri=i—l. Similarly we have rr—mm{r—di,r-\-l—i). 
As di < Ti < i— 1, T—di>T—{i—l). Thus we set Tr = r+l— i. 

5.3 Sharing Computations 

Given a selected substring w, there may be large numbers 
of strings in Cl{w) similar to string s. When computing the 
edit distance of the left parts si and ri (and that of the right 
parts Sr and rr), we can share the computations if they have 
common prefixes. Next we discuss how to share computa- 
tions. As the strings in Ci{w) are sorted in alphabetical 
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Algorithm 3: Verification(s, r) 

Input: s: A string; Cl{w): Inverted list; r: Threshold 
Output: n= {{s eS,r eS) \ed (s,r) < r} 

1 begin 

2 n = 

3 Tr = r -\- 1 — i; 

4 for r G do 

5 di = VERIFYSTRINGPAIR(si ,ri,ri); 

6 if < Ti then 

7 (i^ = VERIFYSTRINGPAIR(Sr, Tr, Tr)] 

8 if < Tr then IZ ^ (r, s); 



9 end 



Function VerifyStringPairCs, r, r') 

Input: s: A string; r: A string; r' : A threshold 
Output: d = min(r^ + 1, ed(s, r)) 
1 begin 

Using the length-aware verification with the threshold r' 
and sharing the computations on common prefixes; 

3 if Early Termination then d = r' -\- 1; 

4 else d = ed(s, r); 



5 end 



6. EXPERIMENTAL STUDY 

We have implemented our method and conducted an ex- 
tensive set of experimental studies on three real datasets: 
DBLP Author^ DBLP Author+Title, and AOL Query Log*. 
DBLP Author is a dataset with short strings, DBLP Au- 
thor+Title is a dataset with long strings, and the Query 
Log is a set of query logs. Table 2 shows the detailed infor- 
mation of the datasets. Note that the Author+Title dataset 
is the same as that used in ED-JoiN and the Author dataset 
is the same as that used in Trie- Join. Figure 11 shows the 
string length distributions of different datasets. 

Table 2: Datasets 



Datasets 


Cardinality 


Avg Len 


Max Len 


Min Len 


Author 


612781 


14.826 


46 


6 


Query Log 


464189 


44.75 


522 


30 


Author+l'itle 


863073 


105.82 


886 


21 



Figure 10: Verification algorithm 

order, we visit strings in Cl{w) in order. Suppose the first 
string is ri and its three parts are ri^ , ri^ , ri^ . We compute 
the edit distance between ri^ and si using the dynamic- 
programming algorithm. We store the matrix for ri^ and 
si . For the next string r2 with left part r2i , we use the 
stored matrix to compute the edit distance between and 
si . We first compute the longest common prefix between r2i 
and rii , denoted by c. When computing the edit distance 
between si and , we use the stored matrix on si and c 
which has already been computed for si and ri^. Then for 
the characters after c in r2i , we continue the computation 
using the kept matrix. Thus we avoid many unnecessary 
computations. Notice that we do not need to maintain mul- 
tiple matrixes and only keep a single matrix for the current 
string. We use the same idea on the right parts(sr, Tr). 

5.4 Verification Algorithm 

Based on our proposed techniques, we improve the Ver- 
ification algorithm. Consider a string s, a selected sub- 
string and an inverted list Cl{w). For r G Cl{w), we 
use the extension-based method to verify the candidate pair 
(s, r) as follows. We first compute n = i—1 and Tr = r+l— i. 
Then for each r G Cl{w), we compute the edit distance (di) 
between ri and si using the tighter bound ri . U di > ri, we 
terminate the verification; otherwise we verify whether Sr 
and Tr are similar with threshold Tr. When computing the 
edit distance between si and ri (sr and rr), we use the length- 
aware verification and share the computations on common 
prefixes. Figure 10 illustrates the pseudo-code. 

5.5 Correctness and Completeness 

We prove correctness and completeness of our algorithm 
as formalized in Theorem 6. 

Theorem 6. Our algorithm satisfies (1) completeness: Given 
any similar pair {s,r), our algorithm must find it as an an- 
swer; and (2) correctness: A pair {s,r) found in our algo- 
rithm must be a similar pair. 



We compared our algorithms with state-of-the-art meth- 
ods, ED- Join [23] and Trie- Join [20]. As ED- Join and 
Trie- Join outperform other methods, e.g., Part-Enum [2] 
and All-Pairs-Ed [3] (also experimentally proved in [23, 20]), 
in the paper we only compared our method with the two 
best studies. We downloaded their binary codes from their 
homepages, ED- Join § and Trie- Join ^. 

All the algorithms were implemented in C++ and com- 
piled using GCC 4.2.4 with -03 flag. All the experiments 
were run on a Ubuntu machine with an Intel Core 2 Quad 
X5450 3.00GHz processor and 4 GB memory. 

6.1 Evaluating Substring Selection 

In this section, we evaluate substring selection techniques. 
We implemented the following four methods. (1) The length- 
based selection method, denoted by Length, which selects 
the substrings with the same lengths as the segments. (2) 
The shift-based method, denoted by Shift, which selects the 
substring by shifting [— r, r] positions as discussed in Sec- 
tion 4. (3) Our posit ion- aware selection method, denoted 
by Position. (4) Our mult i- match- aware selection method, 
denoted by M u It i- match. We first compared the total num- 
ber of selected substrings. Figure 12 shows the results. 

We can see that the Length-based method selected large 
numbers of substrings. The number of selected substring 
of the Position-based method was about a tenth to a fourth 
of that of the Length-based method and a half of the Shift- 
based method. The Multi- match-based method further re- 
duced the number of selected substrings to about a half of 
that of the Position-based method. For example, on Au- 
thor dataset, for r = 1, the Length-based method selected 
19 million substrings, the Shift-based method selected 5.5 
million substrings, the Position-based method reduced the 
number to 3.7 million, and the Multi- match-based method 
further deceased the number to 2.4 million. Based on our 
analysis in Section 4, for strings with /, the length-based 
method selected (r+l)(|s| + l) — / substrings, the shift-based 
method selected (r + 1)(2t + 1) substrings, the position- 
based method selected (r + 1)^ substrings, and the multi- 

2 _a2 

match-aware method selected ^ — J + r + l substrings. If 
|s|=/=15 and r = 1, the number of selected substrings of the 
four methods are respectively 17, 6, 4, and 2. Obviously the 
experimental results consisted with our theoretical analysis. 



' http 
*http 
^http 
^http 



: / /www. informatik.uni-trier.de/ ~ley/db 
: / /www. gregsadetsky.com/aol-data/ 

: / / www.cse.unsw.edu.au / ~weiw/project / simjoin.html 
:/ /dbgroup. cs.tsinghua.edu.cn/wangjn/ 
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Figure 13: Elapsed time 

We also compared the elapsed time to generate substrings. 
Figure 13 shows the results. We see that the Mu It i- match- 
based method outperformed the Position-based method which 
in turns was better than the Shift-based method and the 
Length-based method. This is because the elapsed time de- 
pended on the number of selected substrings and the Multi- 
match-based selected the smallest number of substrings. 

6.2 Evaluating Verification 

In this section, we evaluate our verification techniques. 
We implemented four methods. (1) The naive method, de- 
noted by 2t + 1 , which computed 2t + 1 values in each row 
and used the naive early-termination technique (if all values 
in a row are larger than r, we terminate). (2) Our length- 
aware method, denoted by r+1, which computed r+l values 
in each row and used the expected edit distance to do early 
termination. (3) Our extension-based method, denoted by 
Extension, which used the extension-based framework. It 
also computed r+l rows and used the expected edit distance 
to do early termination. (4) We used the extension-based 
method with sharing computations on common prefixes, de- 
noted by SharePrefix. Figure 14 shows the results. 

We see that the naive method had the worst performance, 
as it needed to compute many unnecessary values in the ma- 
trix. Our length-aware method was 2-5 times faster than the 
naive method. This is because our length-aware method can 
decrease the complexity from 2r + 1 to r + 1 and used ex- 
pected edit distances to do early termination. The extension- 
based method achieved higher performance and was 2-4 times 



(Avg Len = 45) (c) Author+Title (Avg Len : 
for generating substrings 



105) 



faster than the length-aware method. The reason is that the 
extension- based method can avoid the duplicated computa- 
tions on the common segments and it also used a tighter 
bound to verify the left parts and the right parts. The 
SharePrefix method achieved the best performance, as it can 
avoid many unnecessary computations for strings with com- 
mon prefixes. For example, on the Author dataset, for r — A 
the naive method took 10, 000 seconds, the length-aware 
method decreased the time to 4000 seconds, the extension- 
based method reduced it to 2000 seconds, and the SharePrefix 
method further improved the time to about 700 seconds. On 
the Query Log dataset, for r = 8, the elapsed time of the 
four methods were respectively 3500 seconds, 1500 seconds, 
600 seconds, and 450 seconds. 

6.3 Comparison with Existing Methods 

In this section, we compare our method with state-of-the- 
art methods ED- Join [23] and Trie- Join [20]. As Trie- 
Join had multiple algorithms, we reported the best results. 
For ED-JoiN, we tuned its parameter q and reported the 
best results. As Trie- Join was efficient for short strings, 
we downloaded the same dataset from Trie- Join homepage 
(i.e.. Author with short strings) and used it to compare with 
Trie-Join. As ED-Join was efficient for long strings, we 
downloaded the same dataset from ED- Join homepage (i.e., 
Author+Title with long strings) and used it to compare with 
ED- Join. Figure 15 shows the results, where the elapsed 
time included the indexing time and the join time. 

On the Author dataset with short strings. Trie- Join out- 
performed ED-JoiN, and our method was much better than 
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Figure 14: Elapsed time for verification 
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them, especially for r > 2. The main reason is as follows. 
ED-JoiN must use a smaller q for a larger threshold. In this 
way ED- Join will involve large numbers of candidate pairs, 
since a smaller q has rather lower pruning power [23] . Trie- 
Join used the prefix filtering to find similar pairs using a trie 
structure. If a small number of strings shared prefixes, Trie- 
Join had low pruning power and was expensive to traverse 
the trie structure. Instead our framework utilized segments 
to prune large numbers of dissimilar pairs. The segments 
were selected across the strings and not restricted to prefix 
filtering. For instance, for r=4, Trie-Join took 2500 sec- 
onds. Pass-Join improved it to 700 seconds. ED-Join was 
rather slow and even larger than 10,000 seconds. 

On the Author+Title dataset with long strings, our method 
significantly outperformed ED-JoiN and Trie-Join, even in 
2-3 orders of magnitude. This is because Trie-Join was 
rather expensive to traverse the trie structures with long 
strings, especially for large thresholds. ED-JoiN needed to 
use a mismatch technique to do pruning which was ineffi- 
cient while our filtering algorithm is very efficient. In addi- 
tion, our verification method was more efficient than existing 
ones. For instance, for r = 8, Trie-Join took 15,000 sec- 
onds, ED-JoiN decreased it to 5000 seconds, and Pass-Join 
improved the time to 130 seconds. 

In addition, we compared index sizes on three datasets, 
as shown in Table 3. We can observe that existing meth- 
ods involve larger indices than our method. For example, 
on the Author+Title dataset, ED-JoiN had 335 MB index. 



Trie- Join used 90 MB, and our method only took 2.1 MB. 
There are two main reasons. Firstly for each string with 
length /, ED- Join generated / — ^ + 1 grams where q is the 
gram length, and our method only generated r+1 segments. 
Secondly for a string with length /, we only maintained the 
indices for strings with lengths between l — r and /, and ED- 
JoiN kept indices for all strings. Trie-Join needed to use 
a trie structure to maintain strings, which had overhead to 
store the strings (e.g., pointers to children and indices for 
searching children with a given character). 

Table 3: Index sizes (MB) 



Data Sets 


Data Sizes 


ED-JoiN 
(g = 4) 


Trie-Join 


Pass-Join 
(r = 4) 


Author 


8.7 


25.34 


16.32 


1.92 


Query Log 


20 


72.17 


69.65 


4.96 


Author+Title 


88 


335.24 


90.17 


2.1 



6.4 Scalability 

In this section, we tested the scalability of our method. 
We varied the number of strings in the dataset and tested 
the elapsed time. Figure 16 shows the results. We can see 
that our method achieved nearly linear scalability. For ex- 
ample, for r = 4, on the Author dataset, the elapsed time for 
400,000 strings, 500,000 strings, and 600, 000 strings were 
respectively 360 seconds, 530 seconds, and 700 seconds. 

7. RELATED WORK 

There have been many studies on string similarity joins [7, 
2, 3, 6, 18, 23, 24, 19]. The approaches most related to 
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ours are Trie- Join [20], All-Pairs-Ed [3], ED- Join [23], and 
Part-Enum [2]. All-Pairs- Ed is a ^-gram-based method. It 
first generates g^-grams for each string and then selects the 
first qr -\- 1 grams as a gram prefix based on a pre-defined 
order. It prunes the string pairs with no common grams and 
verifies the survived string pairs. ED- Join improves All- 
Pairs-Ed using location-based and content-based mismatch 
filter by decreasing the number of grams. It has been shown 
that ED- Join outperforms All-Pairs- Ed [3]. Trie- Join uses 
a trie structure to do similarity joins using prefix filtering. 
Part-Enum proposed an effective signature scheme called 
Part-Enum to do similar joins for hamming distance. It 
has been proved that All-Pairs-Ed and Part-Enum are worse 
than ED- Join and Trie- Join [20]. Thus we only compared 
with ED-JoiN and Trie-Join. 

Gravano et al. [7] proposed gram-based methods and used 
SQL statements for similarity joins inside relational databases, 
Sarawagi et al. [18] proposed inverted index-based algorithms 
to solve similarity- join problem. Chaudhuri et al. [6] pro- 
posed a primitive operator for effective similarity joins. Arasu 
et al. [2] developed a signature scheme which can be used 
as a filter for effective similarity joins. Xiao et al. [25] pro- 
posed ppjoin to improve all-pair algorithm by introducing 
positional filtering and suffix filtering. Xiao et al. [24] stud- 
ied top-A; similarity joins, which can directly find the top-k 
string pairs without a given threshold. 

In addition, Jacox et al. [11] studied the metric-space sim- 
ilarity join. As this method is not as efficient as ED-JoiN 
and Trie- Join [20], we did not compare with it in the pa- 
per. Chaudhuri et al. [6] proposed the prefix-filtering signa- 
ture scheme for effective similarity join. Recently, Wang et 
al. [21] devised a new similarity function by tolerating to- 
ken errors in token-based similarity and developed effective 
algorithms to support similarity join on such functions. 

The other related studies are approximate string search- 
ing [5, 14, 8, 9, 26], which given a query string and a set 
of strings, finds all similar strings of the query string in the 
string set. Navarro studied the approximate string matching 
problem [17], which given a query string and a text string, 
finds all substrings of the text string that are similar to 
the query string. These two problems are different from our 
similarity- join problem, which given two sets of strings, finds 
all similar string pairs. There are some studies on selectiv- 
ity estimation of approximate string queries [10, 12, 13] and 
approximate entity extraction [1, 4, 22, 15]. 

8. CONCLUSION 

In this paper, we have studied the problem of string sim- 
ilarity joins with edit-distance constraints. We propose a 
partition-based method to do efficient similarity joins. We 
first sort strings and then visit strings in order. We build 
inverted indices for the visited strings. For each string, we 
select some of its substrings and utilize the selected sub- 
strings to find similar string pairs using the inverted in- 
dices. We develop a posit ion- aware method and a multi- 
match-aware method to select substrings. We prove that 
the mult i- match- aware selection method can minimize the 
number of selected substrings. We also develop efficient 
techniques to verify candidate pair based on length differ- 
ence. We propose an extension-based method and share 
the computations on common prefixes to further improve 
the verification performance. Experiments show that our 
method outperforms state-of-the-art studies on both short 
strings and long strings. 
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